Main goals of the session

  1. Learn how to (1) simulate genetic drift, (2) to infer the effect of genetic drift on the temporal dynamic of genetic variation} by simulation and and (3) to graph the the impact of mutation alone on allele frequency change

  2. To create R scripts from scratch to solve the previous objectives

1. R scripts

Scripts which are going to be created in this session:

Introduction: the Wright-Fisher model to describe the behaviour of genetic drift

Genetic drift is the change in the frequency of an existing allele in a population due to random sampling of organisms.

The Wright-Fisher model –named after early pioneers of theoretical population genetics, Sewall Wright and Ronald Fisher– describes the sampling of alleles (A and a) in a population with no selection, no mutation, no migration, non-overlapping generation times and random mating in a constant size population.

1. Random sampling in R

R has a convenient function for handling sample selection: sample(). This function addresses common cases considered by the Wright-Fisher model, such as:

  • Picking from a finite set of values
  • Sampling with replacement

The default setting of sample() is randomly sorting the values of a vector. These are returned to the user in random order. For example:

# Random sampling of a vector
sample(c(1:10))
##  [1]  7  9  8  2 10  4  5  1  3  6

We get the same vector, but shuffled into a different random order. But, what if a value can be selected multiple times? This is known as sampling with replacement. The sample() function supports this using an additional parameter: replace. Replace can be T (TRUE) or F (FALSE). By default, it assumes no replacement. For example:

# Random sampling of a vector with replacement
sample(c(1:10), replace = T)
##  [1] 9 2 4 8 2 2 9 4 8 8

As you can see, we get the vector of the same size (10 values), in random order, and certain values are repeatedly picked.

We can add the size parameter to return only a few values. For example, the following code will pick three values, with replacement:

# Random sampling three elements of a vector with replacement
sample(c(1:10), size = 3, replace = T)
## [1] 9 9 3

If you run each command several times you’ll get different results every time, and not exactly the ones that are shown here. You may be wondering: how can I make reproducible results? You can use the set.seed(x) function: it allows you to produce the same sample again and again by setting a seed (x, a number). For example:

# Random sampling three elements of a vector with replacement
set.seed(4)
sample(c(1:10), size = 3, replace = T)
## [1] 8 3 3

The result of this sample will always be 8 3 3 when the seed is 4.

The prior examples assume that we are selecting values at random for a vector. But, what if there are some values which are more probable to be picked than others? sample() allows to adjust the probability of each item being selected, using the prob parameter.

For example, imagine that we have developed a test to detect COVID-19 from saliva samples. Our quality isn’t great, so there’s a 25% change of developing a defective test. We test 6 people. We can simulate this scenario using the following code:

# Random sampling six elements of a vector with replacement with probabilities
sample(c('Good', 'Defective'), size = 6, replace = T, prob = c(0.75, 0.25))
## [1] "Defective" "Good"      "Good"      "Defective" "Defective" "Good"

As we can see, the error rate is 33%, higher than we simulated, but we can get even higher error rates… maybe we should consider quitting developing COVID-19 tests!

1.1 DNA sequence generator

Let’s try the sample() function on more biological stuff: for example, let’s simulate a DNA random sequence! Let’s simulate a DNA sequence of a length 100 nucleotides, with a GC% content of 40%.

# Random DNA sequence generator taking into account GC content
nucleotides = c("A", "T", "C", "G") 
composition = c(0.3, 0.3, 0.2, 0.2)
lengthSeq = 100
seq100 <- sample(nucleotides, size = lengthSeq, prob = composition, replace = T)
length(seq100)
## [1] 100
table(seq100)
## seq100
##  A  C  G  T 
## 25 17 28 30
table(seq100)/lengthSeq
## seq100
##    A    C    G    T 
## 0.25 0.17 0.28 0.30

The table() function is very useful in this situation because it performs categorical tabulation of the data, with the variable and its frequency. We can see how the simulated DNA sequence, has a nucleotide composition according to the GC% we set.

1.2 Random sampling of alleles

Finally, let’s simulate the random sampling of alleles in a population of a given size, to get closer to our objective: simulate the Wright-Fisher model.

For example, let’s try the random sample of 20 alleles A and a, with a frequency p = 0.8 and q = 0.2, respectively, in a population of a given effective size Ne of 10:

# 20 alleles
# 10 Ne
# allele A: p = 0.8
# allele a q = 0.2
alleles = c("A", "a") 
freq = c(0.8, 0.2)
Ne = 10
sample_alleles = sample(alleles, size = 2*Ne, prob = freq, replace = T)
table(sample_alleles)
## sample_alleles
##  a  A 
##  3 17
table(sample_alleles)/(2*Ne)
## sample_alleles
##    a    A 
## 0.15 0.85

If we divide the allele count by 2\(\times\)Ne, we get the allele frequency, in this case, slightly different from the initial frequency value.

Script 1: Simulation of the genetic drift (a linear random walk) using the sample() function

The function will use the initial allele frequency of allele A (p_0), the effective population size (Ne) and the number of generations to be simulated (tgen).

# Genetic drift
# Simulation of random drift trajectory of allele frequency for one locus with two alleles 
# Ne = effective population size; tgen = number of generation simulated; p_0 = initial frequency of allele A)
# A single trajectory (a linear random walk)

genetic_drift <- function(p_0, Ne, tgen) {
  alleles <- c(1, 0) # 1 A and 0 a (numerical for estimating frequency directly)
  pvector <- c() # Store p allele frequency values over generations
  pvector[1] <- p_0 # First value is initial p frequency
  p_ = p_0
  for (i in 2:(tgen+1)) {
    p_ = sum(sample(alleles, size = 2*Ne, prob = c(p_, 1-p_), replace = T))/(2*Ne) ;
    pvector[i] = p_ # Add new frequency 
  }
  plot(pvector, ylim = c(0, 1),
       col="coral2", type = "l",lwd = 3,
       main = paste("Genetic drift - Population size = ", Ne), 
       xlab = "Generation", ylab = "Frequency of p")
} 
# Call the function
genetic_drift(0.5, 100, 100)

The result of this function is a plot, describing the trajectory of an allele over 100 generations.

In this simulation, the locus stays polymorphic after 100 generations. But can be the case that the allele gets fixed or lost due to genetic drift.

2. The binomial distribution describes the sampling process of genetic drift

Genetic drift is a stochastic process, individual outcomes are dictated by chance, but a large number of outcomes can be modelled as a probability distribution.

The binomial formula describes the probability distribution of i successes in N trails, where the probability of success is p: similar to sampling alleles from a population. Consider a population of A and a alleles. If the frequencies of the A1 and A2 alleles in the population are p and q, in a 2N gametes sample, the probabilities of the different results are given by the development of (p + q)2N.

Let x be the probability of obtaining i gametes A1 and 2N-i gametes A2:

\[ p(i=A) = \binom{2N}{i} p^iq^{2N-i}\]

where \(\binom{2N}{i} = \frac{(2N)!}{i!(2N-i)!}\)

and \(p^i\) is the probability of drawing \(i\) alleles if the frequency is \(p\), and \(q^{2N-i}\) is the probability of the remainder of the drawn alleles being B given the frequency \(q\). \(\binom{2N}{i}\) represents the probability of the number of successes, \(i\), given \(2N\) trials.

This process describes the binomial sampling of alleles each generation. For example, what is the probability of drawing 2 A alleles from in 4 trials (2N = 4), given that p = 0.5?

\[ \binom{4}{2}=\frac{4!}{2!(4-2)!} = \frac{24}{4} = 6 \]

2N actually refers to the number of draws you are making, so if you make 4 draws it is written as \(\binom{4}{2}\) for example. The notation here is 2N because in a Wright-Fisher model of drift you are usually drawing two alleles from each individual, thus you make twice as many draws as there are individuals, hence 2N. Adding this into the equation we get:

\[P_{(i=A)}=6 \times p^{i}q^{2N-i} = 6 \times 0.5^{2}0.5^{2}= 0.375\]

2.1 Binomial distribution in R

To perform a random binomial simulation in R, we can use the rbinom() function. This function needs three arguments: the number of observations you want to get (n), the number of trials per observation (size) and the probability of success of each trial (prob).

Coming back to our example of COVID-19 test. We make 150 tests per day and the defective ones must be repaired. We know that there is a 5% of error rate. To estimate the number of tests that we need to fix each day in a week is:

# Binomial simulation in r
rbinom(n = 7, size = 150, prob = 0.05)
## [1]  8  7  7  9  4 14  9

Everyday of the week we get a different number of test to repair, according to the error rate.

2.2 Random sampling of alleles using the binomial distribution

To demonstrate a simple case of drift, we’ll use the previous example. We have a population of 2Ne = 4 individuals and we will focus on a single locus A with two alleles, A1 and A2. We will assume both alleles are equally common in the population (p = 0.5 and q = 0.5). To achieve this in R:

N <- 4 # set population size
p <- 0.5 # set allele frequency
# Binomial sampling to get the new frequency of the A1 in the next generation 
rbinom(n = 1, size = 2*N, prob = p)/(2*N)
## [1] 0.75

You will see though that the new frequency is likely different to the initial frequency solely because of random sampling.

Script 2: Simulation of the genetic drift (a linear random walk) using the random binomial distribution function

We are going to modify function 1 to implement a random binomial distribution function instead of a simply random sample function.

# Genetic drift
# Simulation of random drift trajectory of allele frequency for one
# locus with two alleles 
# Ne = effective population size; tgen = number of generation simulated;
# p_0 = initial frequency allele A)
# A single trajectory (a linear random walk)
# using random binomial distribution function

genetic_drift <- function(p_0, Ne, tgen) {
pvector <- c()
pvector[1] <- p_0
p_ = p_0

for(i in 2:(tgen+1)) {
    p_ = rbinom(1,2*Ne,p_)/(2*Ne) # Random binomial distribution
    pvector[i] <- p_
}
plot(pvector, ylim = c(0,1), 
    col="coral2", type = "l", lwd = 3,
    main = paste("Genetic drift - Population size = ",Ne), xlab="Generations" , ylab = "Relative frequency")
}
# Call the function
genetic_drift(0.7, 50, 100)

3. Wright-Fisher model for n replicates (replicated random walks)

Finally, we are going to simulate the trajectory of multiple replicates instead of just one.

genetic_drift <- function(p_0, Ne, tgen, nrepl) {
  # Empty plot
  plot(0,0, xlim=c(0,tgen+1), ylim = c(0,1), 
      col="coral2", type = "l", lwd = 3,
      main = paste("Genetic drift - Population size = ",Ne), xlab="Generations",
      ylab = "Relative frequency")
  # For to iterate over replicates
  for (n in 1:nrepl) {
    pvector <- c()
    pvector[1] <- p_0
    p_ = p_0
    for(i in 2:(tgen+1)) {
      p_ = rbinom(1,2*Ne,p_)/(2*Ne); 
      pvector[i] <- p_
    }
    # Draw a line for each replicate
    lines(pvector, type = "l", col = n, lwd = 2)
    # print(paste(p_))
  }
}
# Call the function
genetic_drift(0.3, 100, 100, 10)

This graph represents the trajectory of the allele frequency in 10 replicates.

Exercise 1 | Effect of the genetic drift in a locus

We will first analyze the stochastic effect of genetic drift in the absence of selection. Compare the proportion of populations (i) still polymorphic, (ii) that fix the A allele (p = 1), and (iii) that lose the A allele (p = 0), in populations of different sizes (500, 100 and 10) in 100 generations, starting at allele frequencies A p0 = 0.5 and p0 = 0.9. Make at least 20 replicates of each simulation and record the results in a table. How do the results differ according to the initial frequencies of allele A and according to population size? Give an explanation to your results.

Answer:
Simulation A is lost (p=0) Polymorphic A is fixed (p=1)
p0=0.5, N=500 0 20 0
p0=0.5, N=100 1 18 1
p0=0.5, N=10 8 0 12
p0=0.9, N=500 0 19 1
p0=0.9, N=100 0 8 12
p0=0.9, N=10 6 0 14
# (p_0, Ne, tgen, nrepl)

genetic_drift(0.5, 500, 100, 20) # All of them are polymorphic

genetic_drift(0.5, 100, 100, 20) # There are a few where the allele is lost or fixed, but mainly the alleles are polymorphic

genetic_drift(0.5, 10, 100, 20) # All the alleles are fixed or lost (there are so little polymorphic

genetic_drift(0.9, 500, 100, 20) # All the alleles are have really high frequencies, few of them are fixed

genetic_drift(0.9, 100, 100, 20) # Some of the alleles are fixed, a few remain polymorphic

genetic_drift(0.9, 10, 100, 20) # All the alleles are fixed or lost, there are no polymorphic alleles

# When increasing the Ne, the number of alleles that are fixed/lost decreases. We have less alleles fixed/lost if the effective population is high.
# When starting a higher p_0 the frequency of the allele is higher, having higher number of fixations and less losses.

Exercise 2 | Allele frequency differences

Consider the whole set of replicates of genetic drift for a given initial conditions. How do average allele frequencies and allele frequency differs among replicates change over generations? Add on the previous script the code to graph the average (mean) allele frequency. When is the maximum difference in allele frequencies among replicates reached? Which is its value?

genetic_drift <- function(p_0, Ne, tgen, nrepl) {
  
  plot(0,0, xlim=c(0,tgen+1), ylim = c(0,1), 
       col="coral2", type = "l", lwd = 3,
       main = paste("Genetic Drift - Population size = ",Ne), xlab="Generations" , ylab = "Relative frequency")
  meanpvector <- 0*c(1:(tgen+1))
  meanpvector[1] <- p_0
  for (n in 1:nrepl) {
    pvector <- c()
    pvector[1] <- p_0
    p_ = p_0
    for(i in 2:(tgen+1)) {
      p_ = rbinom(1,2*Ne,p_)/(2*Ne); 
      pvector[i] <- p_
      meanpvector[i]=meanpvector[i]+pvector[i]/nrepl
    }
    lines(pvector, type = "l", col = n, lwd = 2)
  }
  # Add mean
  lines(meanpvector, type = "l",  col = "coral4",  lwd = 6)
  legend("topleft", legend=paste("Mean = ", round(mean(meanpvector),3)),
       col=c("coral4"), lty=1, lwd=3, cex=0.8)
}
genetic_drift(0.5, 500, 100, 10)

Answer:
Simulation Mean Allele difference
p0=0.5, N=500 0.482 All polymorphic
p0=0.2, N=500 0.196 All polymorphic
p0=0.9, N=500 0.89 1 fixed, 9 polymorphic
p0=0.5, N=100 0.517 1 fixed, 1 lost, 8 polymorphic
p0=0.2, N=100 0.186 4 lost, 6 polymorphic
p0=0.9, N=100 0.82 4 fixed, 6 polymorphic
p0=0.5, N=10 0.578 4 lost, 5 fixed, 1 polymorphic
p0=0.2, N=10 0.21 8 lost, 2 fixed
p0=0.9, N=10 0.996 All fixed
genetic_drift(0.5, 500, 100, 10)

genetic_drift(0.2, 500, 100, 10)

genetic_drift(0.9, 500, 100, 10)

genetic_drift(0.5, 100, 100, 10)

genetic_drift(0.2, 100, 100, 10)

genetic_drift(0.9, 100, 100, 10)

genetic_drift(0.5, 10, 100, 10)

genetic_drift(0.2, 10, 100, 10)

genetic_drift(0.9, 10, 100, 10)

# When the p_0 has higher values we get a higher number of fixations, and as an overall the frequencies are larger --> mean value > 0.5
# When the p_0 has lower values we get a higher number of losses, and the frequencies are smaller --> mean value < 0.5
# When the p_0 is 0.5 we get the same losses than fixations (or same number of frequencies are above/below 0.5 as an overall) --> mean value around 0.5

# If we increase Ne we get more polymorphic alleles and fewer fixations/losses.
# When decreasing the Ne we'll get higher number of fixations/losses depending on the initial frequencies.

Exercise 3 | Simulating a bottleneck

Simulate repeatedly a bottleneck at a given t generation and compares its impact with the same population without bottleneck. Initial conditions: p0=0.5, N=1000, 100 generations. At generation 50 the population decreases from 1000 to 10, and at generation 51 the population increases again to 1000.

Hint: To simulate a bottleneck, use a if-else statement.

Answer:
Simulation A is lost (p=0) Polymorphic A is fixed (p=1)
p0=0.5, N=1000, with bottleneck 0 10 0
p0=0.5, N=1000, without bottleneck 0 10 0
genetic_drift_bottleneck <- function(p_0, Ne, tgen, nrepl) {
  
  plot(0,0, xlim=c(0,tgen+1), ylim = c(0,1), 
       col="coral2", type = "l", lwd = 3,
       main = paste("Genetic Drift - Population size (bottleneck) = ",Ne), xlab="Generations" , ylab = "Relative frequency")
  meanpvector <- 0*c(1:(tgen+1))
  meanpvector[1] <- p_0
  for (n in 1:nrepl) {

    pvector <- c()
    pvector[1] <- p_0
    p_ = p_0
    for(i in 2:(tgen+1)) {
      if (i == 50){
        Ne = 10
      }else{
        Ne = 1000
      }
      p_ = rbinom(1,2*Ne,p_)/(2*Ne); 
      pvector[i] <- p_
      meanpvector[i]=meanpvector[i]+pvector[i]/nrepl
    }
    lines(pvector, type = "l", col = n, lwd = 2)
    # print(paste(p_))
  }
  # Add mean
  lines(meanpvector, type = "l",  col = "coral4",  lwd = 6)
  legend("topleft", legend=paste("Mean = ", round(mean(meanpvector),3)),
       col=c("coral4"), lty=1, lwd=3, cex=0.8)
}

genetic_drift(0.5, 1000, 100, 10)

genetic_drift_bottleneck(0.5, 1000, 100, 10)

# When applying the bottleneck in the generation 50 we can see that there is a drastic change in the frequencies of the alleles, which occurs to the sudden reduction in population size during that generation. As it can be seen, it doesn't happen if the bottleneck is not applied.

Exercise 4 | Population size vs. Effective population size

Using the previous initial conditions: which population size better predict the bottleneck effects? the mean population size, or the harmonic mean?

Here you have a simple example on how to calculate the normal mean and the harmonic mean:

popSizes = c(500, 500, 500, 10, 500)
mean(popSizes) # Mean
## [1] 402
1/mean(1/popSizes) # Harmonic mean
## [1] 46.2963

The effective population size Ne is approximately the number of individuals that contributes to the next generation. This is the size of a Wright-Fisher population that would produce the same magnitude of genetic drift (in this case a bottleneck).

Which scenario is more similar to what you observed when there was a bottleneck? Simulate an scenario without a bottleneck, but considering the N as the mean or the harmonic mean.

Answer:
Simulation A is lost (p=0) Polymorphic A is fixed (p=1)
p0=0.5, N=mean 0 10 0
p0=0.5, N=harmonic mean 0 10 0
v <- c(10, rep(1000,99))
Ne_mean <- round(mean(v), 0)
Ne_harm <- round(1/mean(1/v), 0)

genetic_drift(0.5, Ne_mean, 100, 10)

genetic_drift(0.5, Ne_harm, 100, 10)

# The population created using the harmonic mean predicts the bottleneck effects more accurately. This is because when an extremely low value is included in the harmonic mean calculation, it can have a significant impact on the final mean value, unlike what would occur when calculating the mean.

4. Mutation as a population genetics force

A mutation acts as a force that restores genetic variation eliminated by natural selection or that have been lost due to genetic drift. Mutations are the fundamental raw material of evolution.

There are two types of mutation relative to population genetics:

In contrast to non-recurrent mutations, recurrent mutations will change gene (allele) frequencies, as a function of the rate at which such mutations arise (minus the frequency of reverse or backward mutation). The spontaneous mutation rates are very low, typically between 10-7 and 10-9 for point mutations in higher organisms.

A recurrent mutation that changes the allele A to a, has a mutation rate of \(\mu\):

Generation 0: p0 = 1; q0 = 0

Generation 1: q1 = \(\mu\)p0

Generation 2: q2 = \(\mu\)p1 + q1

Generation n: qn = \(\mu\)pn-1 + qn-1

\[ q_n = 1-(1-\mu)^n \times p_0 \]

\[ p_n = (1-\mu)^n \times p_0 \]

Let’s consider a “fight” between forward and backward mutation. Forward mutation changes the A allele to the a allele at a rate (\(\mu\)); backward mutation changes a to A at a rate (\(\nu\)). We can express the frequency (p) of the A allele in the next generation (pt+1) in terms of these opposing forward and reverse mutations:

\[ q_n = q_{n-1} + {\mu}p_{n-1}-{\nu}q_{n-1}\] \[ {\Delta}q = q_n - q_{n-1} = {\mu}p_{n-1}-{\nu}q_{n-1} \] \[ {\Delta}q = 0 \] \[ \hat{q} = \frac{\mu}{\mu + \nu} \] \[ \hat{p} = \frac{\nu}{\mu + \nu} \]

Under this model allele frequencies will change each generation, eventually approaching and equilibrium value.

Script 3: Simulation of mutation dynamics

The following function plots the change in allele frequency q over generations considering a recurrent mutation with a \(\mu\) of 0.0001.

# Recurrent
mut_dyn <- function(p0, mu, tgen) {
    plot(0, 0, xlim = c(0,tgen), ylim = c(0,1), main = 
    paste("Recurrent mutation – mu = ",mu),  xlab="Generation", ylab="q",
    type="l")
    freqvector <- c() 
    freqvector[1] <- 1-p0
    for (i in 1:tgen){
        freqvector[i+1]<-1 - p0*(1- mu)^i
    }
    lines(freqvector, type = "l", lwd=3, col="coral4")
}
mut_dyn(1,0.0001,10000) 

The following function plots the change in allele frequency q over generations considering a recurrent mutation with a \(\mu\) of 0.0001 and a retromutation with a \(\nu\) of 0.00001.

# Recurrent retromutation
mut_dyn_retro <- function(p0,mu,nu,tgen) {
    plot(0,0,xlim = c(0,tgen),ylim = c(0,1),type = "l", main = 
    paste("Mutation and retromutaion – mu = ",mu,"nu =",nu), xlab="Generation",
    ylab="q")
    freqvector <- c() 
    freqvector[1]<- 1-p0
    for (i in 1:tgen) {
        freqvector[i+1]<-freqvector[i] + mu*(1-freqvector[i])- nu*freqvector[i]
    }
  lines(freqvector, type = "l",lwd=3, col="deepskyblue4")
  return(freqvector)
}
x <- mut_dyn_retro(1,0.0001,0.00001,10000)

Exercise 5 | Exercise on mutation

In a population, the D \(\rightarrow\) d mutation rate is 4 \(\times\) 10-6. If p=0.8 today, what will p be after 50,000 generations? If there is a retromutation with rate 6 \(\times\) 10-7, what will be the allele frequency in the equilibrium?

Answer:
p0 <- 0.8
mu <- 0.000004
nu <- 0.0000006
tgen <- 50000

fr <- mut_dyn_retro(p0, mu, nu, tgen)

fr[length(fr)]
## [1] 0.3375734
# The allele frequency in equilibrium is 0.33

Deliver the R Markdown in Aul@-ESCI

Deliver this document in Aul@-ESCI with your answers

Deadline: 22 April 2024

Doubts? and .